Example: Patient Satisfaction

A hospital administrator wished to study the relation between patient satisfaction (a measured index score between 0-100) and patient’s age (measured in years), severity of illness (an index), anxiety level (an index), and sex (coded as 0=male, 1=female). Larger values of the index variables are associated with more satisfaction, increased severity of illness and more anxiety. The administrator randomly selected 46 patients and collected the data. The data appear in the text file patientsatisfaction.txt.

Read in the data and take a quick look:

patsatdat <- read.table("patientsatisfaction.txt", header=TRUE)
kable(head(patsatdat))
satis.index age ill.sev.index anx.index sex
48 50 51 2.3 0
57 36 46 2.3 0
66 40 48 2.2 0
70 41 44 1.8 0
89 28 43 1.8 1
36 49 54 2.9 0

What is different about this compared to the problems we were seeing in Chapters 2-4?

Statisticians use regression to build models to investigate the nature of such relationships amongst variables. When there are multiple predictors being simultaneously considered, the technique is known as multiple regression modeling.

The goals of an analysis of these data are to:

  1. Develop a model that explains the relationship between these predictors and patient satisfaction as best we can. (Note what we just said there … we will develop a model through a process of assessing the “goodness” of a model. The model is not determined based on some designed experiment!)
  2. Use the model to make inferences about the population that the selected sample of patients represents.

The first item in the list above necessitates the importance of knowing the form of the regression model being fitted, because we may try several different models for a given data set. The basic form of a main effects multiple linear regression (MLR) model with \(k\) predictors is:

\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_k X_k + \varepsilon\]

In the present context, this model is given by

\[Patient Satis Score = \beta_0 + \beta_1 (Age) + \beta_2 (Illness Severity Index) + \beta_3 (Anxiety Index) + \beta_4 (Sex) + \varepsilon\]

EDA: Correlation analysis

While it is still important to observe the distribution of the individual variables in the study (like we have done in prior EDAs), the main focus of a regression analysis is to study associations between variables. This can be effectively done via a scatterplot matrix (available in GGally with the ggpairs() function), which displays the pairwise associations that exist in the sample data, along with Pearson correlation coefficients:

ggpairs(patsatdat, columns = c(2:5,1))

You’ll note here that we went out of the way to list the columns we wish to plot. There are only 5 columns in the data but we want the response variable (patient satisfaction, the first column) to be listed last. We do this so in the bottom row of plots, the response variable is on the \(y\)-axis to match our typical interpretation.

Note, another option in the GGally package is the `ggscatmat() which creates a similar plot but it is not as useful for categorical inputs. We may use that function as some point as well.

You may note the uselessness of the sex variable in the above plots. This is because of how it is coded in the dataset (with a 1 or 0). By default R is treating it as a integer/number, so it builds a scatterplot in the normal way. We know the context (0=male, 1=female), so it would make sense to use some other visual displays to look for an association between sex and patient satisfication. In the below code, first we make a copy of the patient satisfication data and change the sex variable to a factor variable (i.e., categorical); then we look at the ggpairs output.

patsatdat2 <- patsatdat %>%
  mutate(sex = factor(sex, c(0,1), labels=c("Male", "Female")))
ggpairs(patsatdat2, columns=c(2,3,4,5,1))

Note the similarity in the two outputs but now for the categorical predictor we see side-by-side boxplots and faceted histograms, which are more likely useful in determining any association between a categorical predictor and a numeric response.

Correlation between two quantitative variables measures the amount of linear relationship between the variables. A correlation is typically denoted with an \(r\) (e.g., \(r=-0.79\) for the satis.index and age variables). As a reminder, any correlation is bounded between (-1, 1) and the distance from 0 provides a magnitude of the strength and direction of the bivariate linear relationship (thus age and satis.index have a moderately strong negative relationship).

What are some things to notice in the above EDA?

  • Do any of the variables appear to be heavily skewed, or are there any unusual observations (outliers) to be concerned about?
  • Satisfaction appears to be negatively correlated with age, illness and anxiety levels. Does this make intuitive sense?
  • Does the correlation between satisfaction index and sex make any sense? Is it meaningful?
  • Does it matter that it appears that the predictors look to be related to each other?

Over the coming weeks we will answer all of these (and more!) questions by covering various statistical methods.

Now, how do we determine how to fit a model of the form above to these data? For that, we consider a simple dataset with just one predictor variable…


Simple Linear Regression Example

Manatees are large, gentle sea creatures that live along the Florida coast. The number of manatees killed by motorboats in the state of Florida has continually been monitored since 1974 by the Florida Fish and Wildlife Conservation Commission. The number of motor boats registered within the state since 1977 is available from the state’s Department of Motor Vehicles. The original data are attribued to Elementary Statistics, 9th edition by Mario Triola but have since been updated. The data are in the file manatee.csv on the canvas site.

Below is a scatter plot of the raw data. Note our creativity in choice of a plotting symbol:

Visually it looks as if there is a upward trending line (more boats registered, more manatees killed). Suppose we decide to fit the following model to the data:

\[Y = \beta_0 + \beta_1 X + \varepsilon\]

where \(Y\) = the number of manatees killed and \(X\) is the number of boats registered in year \(i\).

How do we estimate the coefficients \(\beta_0\) and \(\beta_1\)? Let’s consider different line “fits” to the data in the following interactive display. You may also see details of each data point by hovering your cursor over each point:

What choice appears to be the best? It’s hard to say. For a bit more focus, let’s consider just two specific competing fitted lines:

The plot above includes a depiction of the model “error” for each point for each of the two fitted lines, calculated from the observed number of Manatees Killed as compared to the fitted line (which produces a model-based predicted number of Mantees Killed). Ideally, we want a line with a “smaller” errors overall. We find the best such model using the method of least squares, where we minimize the equation

\[\begin{eqnarray*} RSS &=& \sum_{i=1}^n \left(\textrm{Error at point } i\right)^2 \\ &=& \sum_{i=1}^n \left(e_i\right)^2 \\ &=& \sum_{i=1}^n \left(Y_i - \hat{Y}_i\right)^2 \\ &=& \sum_{i=1}^n \left(Y_i - b_0 - b_1 X_i\right)^2 \end{eqnarray*}\]

For the two models fitted in the display above, the RSS for each is

Group RSS
Model 1 4398.861
Model 2 3222.810

So Model 2 is a better fit than Model 1 (you can see that visually as well) … but is it the best?

To determine the best fitting model overall, we need to use calculus with the above RSS function and find the optimal values of \(b_0\) and \(b_1\). In R, we do this with the lm() function. Consider the following:

lm(ManateesKilled ~ BoatsRegistered, data=manatee)
## 
## Call:
## lm(formula = ManateesKilled ~ BoatsRegistered, data = manatee)
## 
## Coefficients:
##     (Intercept)  BoatsRegistered  
##        -43.5956           0.1326

Thus, for every boat registered, we can expect 0.1326 manatees to be killed by motor boats. Or to phrase it another way, for every 100 boats registered, you can expect approximately 13 manatees on average to be killed by motorboats.

Note the intercept here is conceptually meaningless: it corresponds to the expected number of manatees killed by motorboats when zero motorboats are registered in the state of Florida. (This is a common phenomenon of regression – the intercept often tends to have no contextual interpretation.)


Multiple regression example: Patient Satisfaction

We will fit the multiple regression model

\[Patient Satis Score = \beta_0 + \beta_1 (Age) + \beta_2 (Illness Severity Index) + \beta_3 (Anxiety Index) + \beta_4 (Sex) + \varepsilon\]

to the data using R and the function lm. For now, we will forego the checking of assumptions (we’ll see this very soon!) because the validity of the assumptions underlying the model mainly pertain to using the model for statistical inference purposes (Chapter 6).

mr.model1 <- lm(satis.index ~ age + ill.sev.index + anx.index + sex, data=patsatdat)
summary(mr.model1)
## 
## Call:
## lm(formula = satis.index ~ age + ill.sev.index + anx.index + 
##     sex, data = patsatdat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.347  -6.417   0.520   8.374  17.165 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   158.49433   18.36268   8.631 9.16e-11 ***
## age            -1.14173    0.21951  -5.201 5.86e-06 ***
## ill.sev.index  -0.44201    0.49793  -0.888   0.3799    
## anx.index     -13.46700    7.23154  -1.862   0.0697 .  
## sex            -0.01183    3.04192  -0.004   0.9969    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.18 on 41 degrees of freedom
## Multiple R-squared:  0.6822, Adjusted R-squared:  0.6512 
## F-statistic:    22 on 4 and 41 DF,  p-value: 9.332e-10

Several interpretive things to take note of here:

In the text, we discuss the practical challenges in interpreting the \(\beta\)-parameter estimates in the model fit to observational data. Nonetheless, here is an interpretation of them in the present context. We will discuss in class what the practical challenges are, however:

Now, what are some of the issues with practical interpretation of these parameter estimates?